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ABSTRACT 

Many data sources are naturally modeled by multiple weight as- 
signments over a set of keys: snapshots of an evolving database 
at multiple points in time, measurements collected over multiple 
time periods, requests for resources served at multiple locations, 
and records with multiple numeric attributes. Over such vector- 
weighted data we are interested in aggregates with respect to one 
set of weights, such as weighted sums, and aggregates over multi- 
ple sets of weights such as the Li difference. 

Sample-based summarization is highly effective for data sets that 
are too large to be stored or manipulated. The summary facilitates 
approximate processing queries that may be specified after the sum- 
mary was generated. Current designs, however, are geared for data 
sets where a single scalar weight is associated with each key. 

We develop a sampling framework based on coordinated weighted 
samples that is suited for multiple weight assignments and obtain 
estimators that are orders of magnitude tighter than previously pos- 
sible. We demonstrate the power of our methods through an ex- 
tensive empirical evaluation on diverse data sets ranging from IP 
network to stock quotes data. 

1. INTRODUCTION 

Many business-critical applications today are based on extensive 
use of computing and communication network resources. These 
systems are instrumented to collect a wide range of different types 
of data. Examples include performance or environmental measure- 
ments, traffic traces, routing updates, or SNMP traps in an IP net- 
work, and transaction logs, system resource (CPU, memory) usage 
statistics, service level end-end performance statistics in an end- 
service infrastructure. Retrieval of useful information from this 
vast amount of data is critical to a wide range of compelling appli- 
cations including network and service management, troubleshoot- 
ing and root cause analysis, capacity provisioning, security, and 
sales and marketing. 

Many of these data sources produce data sets consisting of nu- 
meric vectors (weight vectors) associated with a set of identifiers 
(keys) or equivalently as a set of weight assignments over keys. Ag- 
gregates over the data are specified using this abstraction. 

We distinguish between data sources with co-located or dispersed 
weights. A data source has dispersed weights if entries of the 
weight vector of each key occur in different times or locations: (i) 
Snapshots of a database that is modified over time (each snapshot 
is a weight assignment, where the weight of a key is the value of a 
numeric attribute in a record with this key.) (ii) measurements of 
a set of parameters (keys) in different time periods (weight assign- 
ments), (iii) number of requests for different objects (keys) pro- 
cessed at multiple servers (weight assignments). A data source has 
co-located weights when a complete weight vector is "attached" 



to each key; (i) Records with multiple numeric attributes such as 
IP flow records generated by a statistics module at an IP router, 
where the attributes are the number of bytes, number of packets, 
and unit, (ii) Document-term datasets, where keys are documents 
and weight attributes are terms or features (The weight value of a 
term in a document can be the respective number of occurrences), 
(iii) Market-basket datasets, where keys are baskets and weight at- 
tributes are goods (The weight value of a good in a basket can be its 
multiplicity), (iv) Multiple numeric functions over one (or more) 
numeric measurement of a parameter. For example, for measure- 
ment X we might be interested in both first and second moments, in 
which case we can use the weight assignments x and x^. 

A very useful common type of query involves properties of a sub- 
population of the monitored data that are additive over keys. These 
aggregates can be broadly categorized as : (a) Single-assignment 
aggregates, defined with respect to a single attribute, such as the 
weighted sum or selectivity of a subpopulation of the keys. An ex- 
ample over IP flow records is the total bytes of all IP traffic with 
a certain destination Autonomous System 1251 [T] [3911161 1171 . (b) 
Multiple-assignment aggregates include similarity or divergence met- 
rics such as the Li difference between two weight assignments or 
maximum/minimum weight over a subset of assignments 1381 1221 
l9l l21l . Figure[2](A) shows an example of three weight assignments 
over a set of keys and key-wise values for multiple-assignment ag- 
gregates including the minimum or maximum value of a key over 
subset of assignments and the Li distance. The aggregate value 
over selected keys is the sum of key- wise values. 

Multiple-assignment aggregates are used for clustering, change 
detection, and mining emerging patterns. Similarity over corpus of 
documents, according to a selected subset of features, can be used 
to detect near-duplicates and reduce redundancy I41II10|[52II20II37I 
1421 . A retail merchant may want to cluster locations according to 
sales data for a certain type of merchandise. In IP networks, these 
aggregates are used for monitoring, security, and planning I28II22I 
,23|,40| : An increase in the amount of distinct flows on a certain port 
might indicate a worm activity, increase in traffic to a certain set of 
destinations might indicate a flash crowd or a DDoS attack, and 
an increased number of flows from a certain source may indicate 
scanner activity. A network security application might track the 
increase in traffic to a customer site that originates from a certain 
suspicious network or geographic area. 

Exact computation of such aggregates can be prohibitively resource- 
intensive: Data sets are often too large to be either stored for long 
time periods or to be collated across many locations. Computing 
multiple-assignment aggregates may require gleaning information 
across data sets from different times and locations. We therefore 
aim at concise summaries of the data sets, that can be computed in 
a scalable way and facilitate approximate query processing. 
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Sample-based summaries l36ll56ll7ll6l[TT][3n[32l[3]|24l[33] 
1171 1261 1141 1181 are more flexible than other formats: they natu- 
rally facilitate subpopulation queries by focusing on sampled keys 
that are members of the subpopulation and are suitable when the 
exact query of interest is not known beforehand or when there are 
multiple attributes of interest. Existing methods, however, are de- 
signed for one set of weights and are either not applicable or per- 
form poorly on multiple-assignment aggregates. 

Contributions 

We develop sample-based summarization framework for vector- 
weighted data that supports efficient approximate aggregations. The 
challenges differ between the dispersed and co-located models due 
to the particular constraints imposed on scalable summarization. 

Dispersed weights model: A challenge is that any scalable algo- 
rithm must decouple the processing of different assignments - col- 
lating dispersed-weights data to obtain explicit key/vector-weight 
representation is too expensive. Hence, processing of one assign- 
ment can not depend on other assignments. 

We propose summaries based on coordinated weighted samples. 
The summary contains a "classic" weighted sample taken with re- 
spect to each assignment: we can tailor the sampling to be Poisson, 
fc-mins, or order (bottom-fc) sampling. In all threee cases, sampling 
is efficient on data streams, distributed data, and metric data 1111 
I15II27|[T6] and there are unbiased subpopulation weight estimators 
that have variance that decreases linearly or faster with the sample 
size EUHlllSlini. Order samples ll?7ll5T1l48][T]1[T7ll45ll^, 
with the advantage of a fixed sample size, emerge as a better choice. 
Coordination loosely means that a key that is sampled under one as- 
signment is more likely to be sampled under other assignment. Our 
design has the following important properties: 

• Scalability: The procesing of each assignment is a simple adap- 
tation of single-assignment weighted sampling algorithm. Coordi- 
nation is achieved by using the same hash function across assign- 
ments. 

• Weighted sample for each assignment: Our design is especially 
appealing for applications where sample-based summaries are al- 
ready used, such as periodic (hourly) summaries of IP flow records. 
The use of our framework versus independent sampling in differ- 
ent periods facilitates support for queries on the relation of the data 
across time periods. 

• Tight estimators: We provide a principled generic derivation of 
estimators, tailor it to obtain tight unbiased estimators for the min, 
max, and Li, and bound the variance. 

Colocated weights model: For colocated data, the full weight 
vector of each key is readily available to the summarization algo- 
rithm and can be easily incorporated in the summary. We discuss 
the shortcomings of applying previous methods to summarize this 
data. One approach is to sample records according to one particular 
weight assignment. Such a sample can be used to estimate aggre- 
gates that involve other assignment^], but estimates may have large 
variance and be biased. Another approach is to concurrently com- 
pute multiple weighted samples, one for each assignment. In this 
case, single-assignment aggregates can be computed over the re- 
spective sample but no unbiased estimtors for multiple-assignment 
aggregates were known. Moreover, such a summary is wasteful in 
terms of storage as different assignments are often correlated (such 
as number of bytes and number of IP packets of an IP flow). 

'This is standard, by multiplying the per-key estimate with an ap- 
propriate ratio 1511 



We consider summaries where the set of included keys embeds 
a weighted sample with respect to each assignment. The set of em- 
bedded samples can be independent or coordinated. Such a sum- 
mary can be computed in a scalable way by a stream algorithm or 
distributively. 

• We derive estimators, which we refer to as inclusive estimators, 
that utilize all keys included in the summary. An inclusive esti- 
mator of a single-assignment aggregate applied to a summary that 
embeds a certain weighted sample from that assignment is signif- 
icantly tighter than an estimator directly applied to the embedded 
sample. Moreover, inclusive estimators are applicable to multiple- 
assignment aggregates, such as the min, max, and L\ . 

• We show that when the embedded samples are coordinated, the 
number of distinct keys in the summary is minimized. 

Empirical evaluation. We performed a comprehensive empirical 
evaluation using IP packet traces, movies' ratings data set (The Net- 
flix Challenge |44| ), and stock quotes data set. These data sets 
and queries also demonstrate potential applications. For dispersed 
data we achieve orders of magnitude reduction in variance over 
previously-known estimators and estimators applied to independent 
weighted samples. The variance of these estimators is comparable 
to the variance of a weighted sum estimator of a single wight as- 
signment. 

For co-located data, we demonstrate that the size of our com- 
bined sample is significantly smaller than the sum of the sizes of 
independent samples one for each weight assignment. We also 
demonstrate that even for single assignment aggregates, our esti- 
mators which use the combined sample are much tighter than the 
estimators that use only a sample for the particular assignment. 

Organization. The remainder of the paper is arranged as follows. 
Section [2]reviews related work, Section[3]presents key background 
concepts and Section|4]presents our sampling approach. Sections|5]- 
|7]present our estimators: Section|5]develops a generic derivation of 
estimators which we apply to colocated summaries in Section|6]and 
to dispersed summaries in Section|7] Section[8]provides bounds on 
the variance. This is an extended version of |19| . 

2. RELATED WORK 

Sample coordination. Sample coordination was used in survey 
sampling for almost four decades. Negative coordination in re- 
peated surveys was used to decrease the likelihood that the same 
subject is surveyed (and burdened) multiple times. Positive coor- 
dination was used to make samples as similar as possible when 
parameters change in order to reduce overhead. Coordination is 
obtained using the PRN (Permanent Random Numbers) method 
for Poisson samples (5) and order samples 1501 1461 l48l . PRN re- 
sembles our "shared-seed" coordination method. The challenges 
of massive data sets, however, are different from those of survey 
sampling and in particular, we are not aware of previously existing 
unbiased estimators for multiple-assignment aggregates over coor- 
dinated weighted samples. 

Coordination (of Poisson, fc-mins, and order samples) was (re- 
)introduced in computer science as a method to support aggrega- 
tions that involve multiple sets (7||6l[TT][3T][32l[3l [171133117811. Co- 
ordination addressed the issue that independent samples of different 
sets over the same universe provide weak estimators for multiple- 
set aggregates such as intersection size or similarity. Intuitively, 
two large but almost identical sets are likely to have disjoint inde- 
pendent samples - the sampling does not retain any information on 
the relations between the sets. 



2 



This previous work, however, considered restricted weight mod- 
els: uniform, where all weights are 0/1, and global weights, where 
a key has the same weight value across all assignments where its 
weight is strictly positive (but the weight can vary between keys). 
Allowing the same key to assume different positive weights in dif- 
ferent assignments is clearly essential for our applications. 

While these methods can be applied with general weights, by 
ignoring weight values and performing coordinated uniform sam- 
pling, resulting estimators are weak. Intuitively, uniform sampling 
performs poorly on weighted data because it is likely to leave out 
keys with dominant weights. Weighted sampling, where keys with 
larger weights are more likely to be represented in the sample, is 
essential for boundable variance of weighted aggregates. 

With global weights, assignments correspond to sets over the 
same universe. The structure of coordinated samples for general 
weights turns out to be much more involved than with global weights, 
where all samples for different assignments (sets) are derived from 
a single "global" (random) ranking of keys. The derivation of un- 
biased estimators was also more challenging: global weights allow 
us to make simple inferences on inclusion of keys in a set when the 
key is not represented in the sample. These inferences facilitate the 
derivation of estimators but do not hold under general weights. 
Unaggregated data. Sample-based sketches ||30l[T3][l2l and sketches 
that are not samples were also proposed for unaggregated data 
streams (the scalar weight of each key appears in multiple data 
points) |2l. This is a more general model with weaker estimators 
than keys with scalar weights. We leave for future work summa- 
rization of unaggregated data set with vector-weights. 
VarOpt is a weighted sampling design Il8l ll4l that realizes all the 
advantages of other schemes but it is not clear if it can be applied 
with coordinated samples (even with global weights). 
Sketches that are not samples. Sketches that are not sample based 
(4T]|g[Tl|5l[20l|37l|42l|2Tl|29lare effective point solutions for 
particular metrics such as max-dominance |21| or Li |29| differ- 
ence. Their disadvantage is less flexibility in terms of supported 
aggregates and in particular, no support for aggregates over selected 
subpopulations of keys: we can estimate the overall Li difference 
between two time periods but we can not estimate the difference 
restricted to a subpopulation such as flows to particular destination 
or certain application. There is also no mechanism to obtain "rep- 
resentatives" kevs |53| . Lastly, even a practical implementation 
of 1211 1291 involves constructions of stable distributions or range 
summable random variables (whereas for our sample-based sum- 
maries all is needed is "random-looking" hash functions). When 
compared with these methods, for example, to estimate the max- 
dominance norm between weight assignments, our methods feature 
the same asymptotic dependence between approximation and sam- 
ple size (fc — 0(e^^)) in order to support the queries supported by 
these methods. 

Bloom filters I4l l28l also support estimation of similarity metrics 
but summary size is not tunable and grows linearly with the number 
of keys. 

3. PRELIMINARIES 

A weighted set {I, w) consists of a set of keys / and a function 
w assigning a scalar weight value w{i) > to each key i £ I. We 
review components of sample-based summarizations of a weighted 
set: sample distributions, respective sketches, that in our context 
are samples with some auxiliary information, and associating ad- 
justed weights with sampled keys that are used to answer weight 
queries. Sample distributions are defined through random rank as- 
signments II II 1481 1161 1261 1171 1181 that map each key i to a rank 
value r(i). The rank assignment is defined with respect to a family 



weighted set (1,11}) with keys / = {«!,..., jg} and a rank 
assignment r 
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Order samples of size fc = 1, 2, 3 and AW-summaries 

p{i) = min{l, w{i)rf,+i}, aji) = w(i)/p{i) 
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Figure 1: Example of a weighted set, a random rank assign- 
ment with ipps ranks, Poisson and order samples, and respec- 
tive AW-summaries . 



of probability density functions (w > 0), where each r{i) is 
drawn independently according to f^^^i). We say that f^, {w > 0) 
are monotone if for all wi > W2, for all x, Fwi{x) > Fw2{x) 
(where Fw are the respective cumulative distributions). For a set J 
and a rank assignment r we denote by ri(J) the ith smallest rank 
of a key in J, we also abbreviate and write r{J) = ri ( J). 

• A Poisson-T sample of J is defined with respect to a rank assign- 
ment r. The sample is the set of keys with r(i) < r. The sample 
has expected size = Fu}{i) (''")• Keys have independent inclu- 
sion probabilities. The sketch includes the pairs {r{i),w{i)) and 
may include key identifiers with attribute values. 

• An order-k (bottom-k) sample of J contains the k keys ii, . . . ,ik 
of smallest ranks in J. The sketch Sfc( J, r) consists of the k pairs 
{r{ij),w{ij)), j = 1, . . . , fc, and rfc+i(J). (If j J| < A: we store 
only I J| pairs.), and may include the key identifiers ij and addi- 
tional attributes. 

• A k-mins sample of J C / is produced from k independent rank 



assignments. 



The sample is the set of (at most k) 
keys) with minimum rank values r^-^\j), r'^'^\j), . . ., r''°'(J). 
The sketch includes the minimum rank values and, depending on 
the application, may include corresponding key identifiers and at- 
tribute values. 

When weights of keys are uniform, a fc-mins sample is the result 
of k uniform draws with replacement, order-A: samples are k uni- 
form draws without replacements, and Poisson-r samples are in- 
dependent Bernoulli trials. The particular family matters when 
weights are not uniform. Two families with special properties are: 

» EXP ranks: f,„(x-) = lue^™^ {¥yj{x) = 1—e"""^) are exponentially- 
distributed with parameter w (denoted by EXP[u;]). Equivalently, 
if u £ f7[0, 1] then — \n{u)/w is an exponential random vari- 
able with parameter w. EXP[«;] ranks have the property that the 
minimum rank r[J) has distribution EXP[i(;(J)], where w{J) = 
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J w{i). (The minimum of independent exponentially distributed 
random variables is exponentially distributed with parameter equal 
to the sum of the parameters of these distributions). This prop- 
erty is useful for designing estimators and efficiently computing 
sketches (TTlIIIlllTlQtlO. The fc-mins sample [U] of a set is a 
sample drawn with replacement in k draws where a key is selected 
with probability equal to the ratio of its weight and the total weight. 
An order-fc sample is the result of k such draws performed without 
replacement, where keys are selected according to the ratio of their 
weight and the weight of remaining keys ||47||34||48] . 

• IPPS ranks: f™ is the uniform distribution U[G, l/w] (Fw{x) — 
min{l, wx}). This is the equivalent to choosing rank value u/w, 
where u £ U[0, 1]. The Poisson-r sample is an IPPS sample 1341 
(Inclusion Probability Proportional to Size). The order-fc sample is 
a priority sample |45]|26l (PRI). 

Figure [T] shows an example of a weighted set with 6 keys and 
a respective rank assignment with IPPS ranks. The figure shows 
the corresponding Poisson samples of expected size k = 1,2, 3. 
The value r is calculated according to the desired expected sample 
size. The sample includes all keys with rank value that is below r. 
This particular rank assignment yielded a Poisson sample of size 
1 when the expected size was 1, 2, 3. The figure also shows order 
samples of sizes k = 1,2, 3, containing the k keys with smallest 
rank values. 

Adjusted weights. A technique to obtain estimators for the weights 
of keys is by assigning an adjusted weight a{i) > to each key i 
in the sample (adjusted weight a{i) = is implicitly assigned to 
keys not in the sample). The adjusted weights are assigned such 
that E[a(i)] = 'w{i), where the expectation is over the random- 
ized algorithm choosing the sample. We refer to the (random vari- 
able) that combines a weighted sample of (/, w) together with ad- 
justed weights as an adjusted-weights summary (AW-summary) of 
(Ijw). An AW-summary allows us to obtain an unbiased esti- 
mate on the weight of any subpopulation J C /. The estimate 
Ejg J o-U) = Eje,/|a(j)>o "(i) is easily computed from the sum- 
mary provided that we have sufficient auxiliary information to tell 
for each key in the summary whether it belongs to J or not. Fig- 
ure [T] shows example AW-summaries for the Poisson and order 
samples. The set J = {12,24,16} with weight w{J) — w{i2) + 
w{i4) + w{i(i) = 10 + 20 + 10 = 40 has estimate of us- 
ing the three Poisson AW-summaries and estimates 0, 21.74, 38.18 
respectively by the three order AW-summaries . Moreover, for 
any secondary numeric function h{) over keys' attributes such that 
h{i) > > and any subpopulation J, 

is an unbiased estimate of X^jg j ^iJ)- 

Horvitz-Thompson (HT). Let Q be the distribution over samples 
such that if 'w{i) > thenp'^^(i) = Pr{i € s\s G Q,} is posi- 
tive. If we know p'^^(i) for every i G s, we can assign to i G s 
the adjusted weight a{i) — (i^^.^ ■ Since a{i) is when i ^ s, 

E[a(i)] = 'w{i) (a{i) is an unbiased estimator of 'w{i)). These 
adjusted weights are called the Horvitz-Thompson (HT) estima- 
tor 1351 . For a particular Q, the HT adjusted weights minimize 
VAR[a(i)] for all i £ I. The HT adjusted weights for Poisson 
r-sampling are a{i) — ■u;(i)/F„(i) (r). Figure [T] shows the inclu- 
sion probability F^i,(i)(r) and a corresponding AW-summary for 
the Poisson samples. Poisson sampling with IPPS ranks and HT 
adjusted weights are known to minimize the sum X^ie/ VAR(a(i)) 
of per-key variances over all AW-summaries with the same ex- 
pected size. 

HT on a partitioned sample space (HTp) flTl . This is a method 
to derive adjusted weights when we cannot determine Pr{i G sis G 



fl} from the information contained in the sketch s alone. For ex- 
ample, if s is an order-fc sample of (/, w), then Pr{i G s|s G H} 
generally depends on all the weights ■w{i) for i G / and therefore 
cannot be determined from s. 

For each key i we consider a partition of Q into equivalence 
classes. For a sketch s, let -P'(s) C SI be the equivalence class of s. 
This partition must satisfy the following requirement: Given s such 
that i G s, we can compute the conditional probability p*(s) = 
Pr{i G s' I s' G P^{s)} from the information included in s. 

We can therefore compute for all i G s the assignment a{i) = 
w(i)/p^{s) (implicitly, a{i) = for i ^ s.) It is easy to see that 
within each equivalence class, E[a(i)] = wii). Therefore, also 
over Q. we have E[a(i)] = wii). 

Ranli Conditioning (RC) is an HTp method designed for an order- 
k sketch |I7| . For each i and possible rank value r we have an 
equivalence class PI containing all sketches in which the fcth small- 
est rank value assigned to a key other than i is r. Note that if i G s 
then this is the (fc + l)st smallest rank which is included in the 
sketch. It is easy to see that the inclusion probability of i in a sketch 



in PI is p\ 



■ in(i) 



Assume s contains ii, . . . 
value Tfe+i. Then for key ij 



and the (fc + l)st smallest rank 



Ik 



we have s £ P, 



and a{ij) 



(i, ■)('■(!+ 1) 



This RC method was extended in |18l to obtain 



tighter estimates for a set of coordinated order-fc sketches sketches 
with global weights II8I . Figure [T] shows the (fc + l)st small- 
est rank value, the conditional inclusion probability F„(i)(rfe+i) 
and the corresponding AW-summary for each order sample in the 
example. 

We subsequently use the notation r) for the probability sub- 
space of rank assignments that contains all rank assignments r' that 
agree on r for all keys in / \ {i}. 

The RC estimator for order-fc samples with IPPS ranks 1261 has 
a sum of per-key variances that is at most that of an HT estimator 
applied to a Poisson sample with IPPS ranks and expected size fc + 1 
1541 . Order sampling emerges as superior to Poisson sampling, 
since it matches its estimation quality per expected sample size and 
has the desirable property of a fixed sample size. 
Sum of per-liey variances Different AW-summaries are compared 
based on their estimation quality. Variance is the standard metric 
for the quality of an estimator for a single quantity. For a subpop- 
ulation J and AW-summaries a(), the variance is VAR[a(J)] = 
E[a( J)]^ — w{J)'^ . Since our application is for arbitrary subpopu- 
lations that may not specified a priori, the notion of a good metric 
G J| { )>o '^ij)^iJ^/^^^ subtle. Clearly there is no single AW-summary that domi- 
nates all others of the same size (minimizes the variance) for all J. 

RC adjusted weights have zero covariances, that is, for any two 
keys COY[a{i),a{j)] = E[a{i)a{j)] - w{i)w(j) = 117]. 
This property extends to applications of the RC method to coordi- 
nated sketches with global weights II8i . HT adjusted weights for 
Poisson sketches have zero covariances (this is immediate from in- 
depedence). When covariances are zero, the variance of a (J) for 
a particular subpopulation J is equal to .^j COV [a{i),a{j)] = 
X^jgj VAR[a(i)]. For AW-summaries with zero covariances, the 
siun of per-key variances El/[a] = ^^^[^i^)]' ^1^° mea- 

sures average variance over subpopulations of certain weight 1551 . 
SV[a] hence serves as a balanced performance metric 12611171 and 
we use it in our performance evaluation. 

Estimators for Poisson, fc-mins, and order sketches with EXP or 

IPPS ranks have T.V[a\ < (where k is the (expected) 

sample size) IIlllI6ll26ll54lll7l . This bound is tight when keys 
have uniform weights and fc <C |/|, but EV^[a] is lower for order 
and Poisson sketches when the weight distribution is skewed II6I 
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1261 . For a subpopulation J with expected k' samples in the sketch, 
the variance on estimating w( J) is bounded by ui( J)^/(fc' — 2) 111 II 

MM- 

4. MODEL AND SUMMARY FORMATS 

We model the data using a set of keys / and a set W of weight 
assignments over /. For each h £ W, w'^^'^ : I — > TZ>o maps 
keys to nonnegative reals. Figure |2] shows an example data set with 
J = {ii,..., island W = {1,2,3}. Fori G / and 7^ C W, we 
use the notation ui'^^ (i) for the weight vector with entries w'^''^ (i) 
ordered by & G 7^. 

We are interested in aggregates of the form '}2i\d{i)^i fi"^) where 
d is a selection predicate and / is a numeric function, both defined 
over the set of keys I. f{i) and d{i) may depend on the attribute 
values associated with key i and on the weight vector w'^' (i). 

We say that the function //predicate d is single-assignment if it 
depends on for a single b £ W. Otherwise we say that it 

is multiple-assignment. The relevant assignments of / and d are 
those necessary for determining all keys i such that d{i) = 1 and 
evaluating f{i) for these keys. 

The maximum and minimum with respect to a set of assignments 
TZ C W, are defined by f{i) as follows: 

^{maxTtjj-^-j _ jj-^axui''''Hi) ui'™'"'^' (i) = minTO^^'Vi) . (1) 

The relevant assignments for / in this case are TZ. Sums over these 
/'s are also known as the max-dominance and min-dominance norms 
1211 1221 of the selected subset. The maximum reduces to the size 
of set union and the minimum to the size of set intersection for the 
special case of global weights. 

The ratio Eig j "'''""'^HO/E.g j when |7^| = 2 is 

the weighted Jaccard similarity of the assignments TZ on J. The L\ 
difference can be expressed as a sum aggregate by choosing /(i) 
to be 

^ '{l) = W^ '^'{ij—W^ (t) ■ (2) 

For the example in Figure |2l the max dominance norm over even 
keys (specified by a predicate d that is true for 12, 14, ie) and assign- 
ments 7^ = {1, 2, 3} is (i2) + 1'2,3}) ( ^ 
^{max{i,2,3})^^g-) = 15 + 20 + 10 = 45, the Li distance between 
assignments TZ = {2,3} over keys 11,12,13 is w^'"''-^^'^^\i-i) + 

^(il{2,3})(.^) + ™(^1<2,3})(.^) ^ 10 + 5 + 3 = 18. 

This classification of dispersed and colocated models differenti- 
ates the summary formats that can be computed in a scalable way: 
With colocated weights, each key is processed once, and samples 
for different assignments 6 G W are generated together and can be 
coupled. Moreover, the (full) weight vector can be easily incorpo- 
rated with each key included in the final summary. With dispersed 
weights, any scalable summarization algorithm must decouple the 
sampling for different b £ W. The process and result for 6 £ W 
can only depend on the values ui'''^ (i) for i £ I. The final summary 
is generated from the results of these disjoint processes. 
Random rank assignments for (/, W). A random rank assign- 
ment for (/, W) associates a rank value r'*"' (i) for each i £ I and 
b eW. If = 0, = +00. The rank vector of i G /, 

r^^'(i), has entries r'''^(i) ordered by 6 G W. The distribution Q 
is defined with respect to a monotone family of density functions 
fw (w > 0) and has the following properties: (i) For all b and i 
such that w'-''\i) > 0, the distribution of r''''(i) is f„(b)(i). (ii) 

The rank vectors r'^'(i) for i £ I are independent, (iii) For all 
i £ I, the distribution of the rank vector r-'^'(i) depends only on 
the weight vector ui'^^ (i). 



It follows from (i) and (ii) that for each 6 G W, {r^'^ £ 1} 
is a random rank assignment for the weighted set (/, ixj*-'-') with 
respect to the family fw {w > 0). The distribution il is specified 
by the mapping (iii) from weight vectors to distributions of rank 
vectors specifies Q. 

Independent or consistent ranks. If for each key i, the entries 
r^''\i) (b £ W) of the rank vector of i are independent we say 
that the rank assignment has independent ranks. In this case f2 is 
the product distribution of independent rank assignments r*''' for 
(J,™'*')) (6 G W). 

A rank assignment has comistent ranks if for each key i £ I and 
any two weight assignments bi , &2 G W, 

(in particular, if entries of the weight vector are equal then corre- 
sponding rank values are equal, that is, ^'''^-'(i) = 

In the special case of global (or uniform) weights, consistency 
means that the entries of each rank vector are equal and distributed 
according to f„(i) for allb £ W such that w^''^ (i) > 0. Therefore, 
the distribution of the rank vectors is determined uniquely by the 
family fu, (w > 0). This is not true for general general weights. We 
explore the following two distributions of consistent ranks, speci- 
fied by a mapping of weight vectors to probability distributions of 
rank vectors. 

• Shared-seed: Independently, for each key i £ I: 

• u{i) ^ U[0, 1] (where U[0, 1] is the uniform distribution on 

[0,1].) 

.For&GW,rW(i)^F;i„^^j(u(i)). 

That is, for i £ I, r^^^i) (b £ W) are determined using the 
same "placement" (u(i)) in F^(b)(jj. 

Consistency of this construction is an inraiediate consequence of 
the monotonicity property of f^, . 

Shared-seed assignment for IPPS ranks is r'*"^ (i) — u{i)/w''^'^ (i) 
and for EXP ranks, is r'*"^ (i) = — In u{i) /w'-''^ (i). 

• Independent-differences is specific to EXP ranks. Recall that 
EXP[iti] denotes the exponential distribution with parameter w. In- 
dependently, for each key i: 

Let w^'''^\i) < ■ ■ ■ < w^'''^\i) be the entries of the weight 
vector of i. 

• For j G 1 ... ft, dj <- EXP[w'-''^'' (i) - w'-''^-^'' (i)], where 

= and dj are independent. 

• For j G 1 . . ./i, r'-''i'>{i) «- min^^j dj. 

For these ranks consistency is immediate from the construction. 
Since the distribution of the minimum of independent exponential 
random variables is exponential with parameter that is equal to the 
sum of the parameters, we have that for all b £ W, i £ I, r'*"' (i) is 
exponentially distributed with parameter w^''' (i). 
Coordinated and independent sketches. Coordinated sketches 
are derived from assignments with consistent ranks and indepen- 
dent sketches from assignments with independent ranks, fc-mins 
sketches: An ordered set of k rank assignments for (/, W) defines a 
set of I W| fc-mins sketches, one for each assignment b £ W. Order 
and Poisson sketches: A single rank assignment r on (J, W) de- 
fines an order-fc sketch (and a Poisson r'-^'^-sketch) for each b £ W, 
(using the rank values {r'-'''(i)|i G /}). Figure |2] shows examples 
of indepedent and shared-seed consistent rank assignments for the 
example data set and the corresponding order 3-samples. 

In the sequel we mainly focus on order-fc sketches. Derivations 
ai'e similar (but simpler) for Poisson sketches. We shall denote by 
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keys: 7 = {ii, . . . , is} 
weight assignments: «)'^) 



Consistent shared-seed IPPS ranks: 



,,(2) ,„(3) 



(A) 



assignment/key 


n 


«2 


is 


«4 


«5 


«6 


wW 


15 





10 


5 


10 


10 


^(2) 


20 


10 


12 


20 





10 


^(3) 


10 


15 


15 





15 


10 


Example functions /( 












^(max{l,2>) 


20 


10 


12 


20 


10 


10 


^(max{l,2,3}) 


20 


15 


15 


20 


15 


10 


^{mm{l,2}) 


15 





10 








10 


^(mm{l,2,3}) 


10 





10 








10 


■w(Ll{l,2}) 


5 


10 


2 


15 


10 





^{-C-l{2,3}) 


10 


5 


3 


20 


15 






key: 


n 




«3 


u 


«5 


k, 


u 


0.22 


0.75 


0.07 


0.92 


0.55 


0.37 


rW 


0.0147 


+ 00 


0.007 


0.184 


0.055 


0.037 


r-(2) 


0.011 


0.075 


0.0583 


0.046 


+ 00 


0.037 


r-(3) 


0.022 


0.05 


0.0047 


+ 00 


0.0367 


0.037 



Independent IPPS ranks: 



key: 


«i 


«2 


«3 


H 


*5 


«6 


rW 


0.22 
0.0147 


0.75 

+ 00 


0.07 
0.007 


0.92 
0.184 


0.55 
0.055 


0.37 
0.037 


«(2) 
r{2) 


0.47 
0.0235 


0.58 
0.058 


0.71 
0.0592 


0.84 
0.042 


0.25 

+ 00 


0.32 
0.032 


„(3) 
r.{3) 


0.63 
0.063 


0.92 
0.0613 


0.08 
0.0053 


0.59 

+ 00 


0.32 
0.0213 


0.80 
0.08 



(B) 



order 3- 

^(1) 
^(2) 
,„(3) 



order 3- 

«,(2) 
t«(3) 



samples: 
«3, h, *6 

ii, ig, 14, 
13, il, is 

samples: 
23. il, i& 
il, i&, ii 
is, i'o, «2 



Figure 2: (A): Example data set with keys I — {ii, . . . , ie} and weiglit assignments 
aggregates. (B): random rank assignments and corresponding 3-order samples. 



y^^' and per-key values for example 



S{r) the summary consisting of |W| order-fc sketches obtained us- 
ing a rank assignment r. 

fc-mins sketches derived from rank assignments with independent- 
differences consistent ranks have the following property: 

Theorem 4.1. For any bi,b2 G W, the probabilin' that both 
assignments have the same minimum-rank key is equal to the weighted 
Jaccard similarity of the two weight assignments. 

Therefore, the fraction of common keys in the two fc-mins sketches 
is an unbiased estimator of the weighted Jaccard similarity. This 
generalizes the estimator for unweighted Jaccard similarity ||6]. 

The following theorem shows that shared-seed consistent ranks 
maximizes the sharing of keys between sketches. We prove it for 
Poisson sketches and conjecture that it holds also for order and fc- 
mins sketches. 

Theorem 4.2. Consider all distributions of rank assignments 
on (7, W) obtained using a family Fu,. Shared-seed consistent 
ranks minimize the expected number of distinct keys in the union of 
the sketches for {I,w'^^^), b G W. 

Proof. Consider Poisson-r'''^ sketches (b G TZ). Since the 
inclusion of different keys are independent, it suffices to show the 
claim for a single key i. Let p'*"^ = ^ With any 
distribution of rank assignments, the probability that i is included 
in at least one sketch for fe G 7?. is at least maxigTj p'*"'. With 
shared-seed ranks, this probability equals maxf,g7j p'^^\ and hence, 
it is minimized. □ 

Sketches for the maximum weight. For 71 C W, let r (z) = 

miriijgTj r'*"' (i). The following holds for all consistent rank assign- 
ments: 

Lemma 4.1. Let r be a consistent rank assignment for [I ,V\>) 
with respect to i^, (w > 0). Let 7^ C W. Then r('"'"'^'(i) is a 
rank assignment for the weighted set (7, with respect to 

f» (w > 0). 

Proof. From the definition of consistency, r'™'"^'(i) =r''''(i) 
where b = arg max^gT^. w'-''' (i). Therefore, the distribution of 
r("""^'(i) isf^(„ax.K)(i)- It remains to show that {r(™'"-^)(i)|i G 
7} are independent. This immediately follows from the definition 
of a rank assignment: if sets of random variables are independent 
(rank vectors of different keys), so are the respective maxima. □ 

A consequence of Lemma I^Tl is the following: 



Lemma 4.2. From coordinated Poisson ' -/order k-/k-mins 
sketches for TZ C W, we can obtain a Poisson mini,g7j t^''^ -/order 
k-/k-mins sketch for (7, w'^'^'^'^'). 

Proof, fc-mins sketches: we take the coordinate- wise minima 
(and respective keys) of the fc-mins sketch vectors of (7, w''''), b G 
7^. 

Given arank assignment r for (7, W) then by Lemma l^TI r^™'"^^ (i) 
is a rank assignment for (7, ui'""^"'^' ). So by the definition of a fc- 
mins sketch we should take the key achieving minig/ r''"'"''^' (i) 
to the fc-mins sketch of (7, to'™^"'^^), and repeat this for fc different 
rank assignments. 

Let ji, be the key such that r^*"-* (jt) is minimum among all r^''-' (i). 
The lemma follows since 

mmr^''Ujt) — minminr (i) = minminr (i) — min r'™'"'^ ' (i) . 

Poisson r'*"^ -sketches: we include all keys with rank value at 
most miritgTC r'''^ in the union of the sketches. 

Order-fc sketches: we take the fc distinct keys with smallest rank 
values in the union of the sketches. The proof is deferred and is a 
consequence of Lemma ITTI □ 

This property of coordinated sketches generalizes the union-sketch 
property of coordinated sketches for global and uniform weights, 
which facilitates multiple-set aggregates Illl l7ll6l. 
Fixed number of distinct keys for colocated data The number 
of distinct keys in coordinated size-fc sketches is at most |>V|fc. 
It is smaller when weight assignments are more correlated. The 
size varies by the rank assignment when fc is fixed. A different 
natural goal is instead of fixing fc, to fix the number of distinct keys 
to be between [|W|(fc — 1) + 1, |W|fc] distinct keys. For a rank 
assignment r, we define £ to be the largest such that there are at 
most |>V|fc distinct keys in the union of the order-^ sketches with 
respect to r'*' (b G W). As a result, we have varying £ > k 
but sample size in [|W|(fc — 1) + 1, |W|fc]. This sample can be 
computed by a simple adaptation of the stream sampling algorithm 
for the fixed-fc variant. 

Computing coordinated sketches. Coordinated order sketches 
can be computed by a small modification of existing order sampling 
algorithms. If weights are colocated the computation is simple (for 
both shared-seed and independent-differences), as each key is pro- 
cessed once. For dispersed weights and shared-seed, random hash 
functions must be used to ensure that the same seed u{i) is used for 
the key i in different assignments. We apply the common practice 



6 



of assuming perfect randomness of the rank assignment in the anal- 
ysis. This practice is justified by a general phenomenon 1491 1431 , 
that simple heuristic hash functions and pseudo-random number 
generators (3| perform in practice as predicted by this simplified 
analysis. This phenomenon is also supported by our evaluation. 

Independent-differences are not suited for dispersed weights as 
they require use of range summable universal hash functions 1291 

ED. 

5. GENERIC ESTIMATOR 

Consider (/, W), a rank assignment r £ Q,, and a corresponding 
summary S{r). The input to the estimator is a numeric function 
/ and a predicate d, defined for each element in I. We present a 
generic estimator for EiG/|d{,)=i /(O- 

Our estimator assigns adjusted /-weights to a subset S*{r) of 
the keys included in S{r). An estimate for X^^i^jjj^i f{i) is ob- 
tained by summing the adjusted /-weights of keys in S*{r) that 
satisfy the predicate d. A handy property is that the same adjusted 
/-weights can be used for different selection predicates d()o 

We subsequently tailor our generic derivation to different sum- 
mary types (colocated or dispersed weights), independent or coor- 
dinated distributions of rank assignments, and / and d with differ- 
ent dependence on the weight vector. 

We present the derivations for a summary that consists of order- 
k sketches for 6 £ W but it can be adapted to summaries with 
sketches different sizes for each b £ W or for colocated samples 
with fixed number of distinct keys. 

Generic estimator derivation: 

(1) : Identify a selection function S* (r) C S{r) with the following prop- 
erties: 

o For alH G / such that d{i) and f{i) > 0, Pr[i G S*{r') \ r' G 
n{i,r)] > 0. 

o From the information available in 5'{r). we can compute: 
oThesetS*(r) 

o For all i S S* (r), the predicate d{i) and the function f{i) 
o For all i G S* (r), p{i, r) = Pr[i G S* (r') | r' G r)] 

(2) : For i G S* (r) such that d{i) holds, a'.f'l (i) ^ 

(3) : Output E.gs-{r)|dW=i«'*^Hi)- 



We recall that the probability subspace i}{i,r) consists of all 
rank assignments r' such that V6 G W, and \/j G r'^''^ (j) = 

r'''(j). We denote by p{i,r) the probability that i is included in 
S'*(r') for r' € n{i,r). 

Step (1) of the derivation is to identify a mapping from sum- 
maries S{r) to subsets S*{r) C S{r) which can be used by the 
estimator. We require that S*{r) itself, /(i), d{i), and p{i,r) for 
all i G S*{r) can be computed from the summary S{r). To get 
an unbiased estimator (i.e. E[a'^^(j)] — f{i)) we also require that 
for any key i such that d{i) > and /(i) > 0, we have that 
p{i, r) > 0. 

In our tailored derivations, the inclusion event of i in S*{r') is 
typically a union or intersection of events of the form r''*"' (i) < 
''fc+i (I) for some b G W. We refer to S* (r) as the set of applicable 

■^The selection predicate d{i) may seem redundant, as 
/(^) = T.i^id{i) f{i), and we can replace / and 
d with the weight function d{i)f{i) without a predicate. Our 
specification is geared towards multiple queries that share the same 
/ but has different attribute-based selections. For example, the 
Li distance of bandwidth (bytes) for IP destination between two 
time periods, for different subpopulations of flows (applications, 
destination AS, etc.). 



samples. There are typically multiple ways to select a mapping S* 
that obeys the requirements. 

Step (2) computes positive adjusted /-weights a'^' for all keys 
in S*{r) with f{i) > for which d{i) holds. (Other keys have 
implicit zero adjusted /-weights). The requirements of Step (I) 
guarantee that these adjusted weights can be computed from the 
summary. This estimator is an instance of HTP that builds on the 
Rank Conditioning (RC) method |17| (see Section[3}. Conectness, 
which is equivalent to saying that for every i G /, E[a'-"(i)] = 
/(i), is immediate from HTP. 

Step (3) outputs our estimate for '}2iei\d(i)=i /(*)■ The require- 
ments of Step (I) ensure that we can evaluate the predicate d{i) for 
all keys i G S*{r), using information in S{r), and hence, we can 
evaluate the estimate. 

The critical ingredient in achieving the performance of our tai- 
lored estimators is using the most inclusive suitable mapping S* . 

Lemma 5.1. Consider two mappings Si and S2 such that for 
any rank assignment r G fi, Sl(r) C 5*2 (r). Let a^'' and a^J^ be 
the corresponding adjusted weights. Thenforalli G /, yAR[aY\i)] > 
VAR[4^)(i)]. 

Proof. Letpi(z, r) andp2(j, r) be the corresponding inclusion 
probabilities. 

It suffices to establish the relation for a particular r) (since 
the projection of r on / \ {i} is a partition of Q and the adjusted 
weights are unbiased in each partition.) We have VARn(i [a^f^ (i)] = 
/(z)^(l/ph(i, r) — 1) for h = 1,2 (variance of the HT estimator 
on f2(i, r)). 

To conclude the proof, recall from definition that for all i G 7 

and r £ ^1, pi{i,r) < p2{i,r). □ 

When the adjusted weig hts a^f (h = 1,2) have zero covari- 
ances (this is the case for the tailored derivations), we have the 
stronger property that for every J C /, selected by a predicate d, 

yAR[a[^'\j)] > VAR[a^-'''(J)] where 

6. COLOCATED WEIGHTS 

We specialize the generic estimator (Section|5} to summaries of 
data sets with colocated weights. The summary S(r) contains all 
keys i G / such that for at least one b G W, r'''' (i) < r'^^-^ (/) and 
the full weight vector w^^^i) for each included key. Hence, any 
/ and d can be evaluated for all i G S{r). 

We use the generic estimator with S*{r) = S{r) and refer to 
this as inclusive estimators. (We use the term inclusive since they 
use all keys in the union of the order-fc samples.) Inclusive estima- 
tors are applicable when / and d satisfy the condition f{i)d{i) > 
=4> > for all i G I, which simply means that 

any key with a positive contribution to the aggregate has a positive 
probability of being sampled. The probability that i is included in 
S(r') forr' G n{i,r) is 

p{i,r) = PR[3& G W,r'^''\i) < r['\l\{i})\r' G n{i,r)] . (3) 

To compute the summary should include, for each b G W, 
the rank values r^*"' (/) and r^'^-^ (J) and for each i G S{r) and b G 
yV, whether i is included in the order-fc sketch of b (that is, whether 
r''''^{i) < r^^i{I)). This information allows us to determine the 
values r^^^ (I \ {i}) for all i G 7 and 6 G W: if i is included in the 
sketch for 6 then r^'''' (7 \ {i}) =r^'^-^(7). Otherwise, it is r^*"' (7). 
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key, weight 


E,- 










destIP, 4tuple 


5.42 X lO'' 


5.54 X lO'' 


7.47 X 10" 


3.49 X 10-' 


3.98 X 10=" 


destIP, bytes 


2.08 X lO** 


2.17 X lO" 


3.26 X lO" 


9.96 X 10** 


2.26 X lO" 


srcIP+destIP, packets 


4.61 X 10'' 


4.61 X 10'^ 


7.61 X lO** 


1.61 X 10'^ 


6.00 X 10*^ 


sicIP+destlP, bytes 


2.08 X lO'' 


2.17 X lO" 


3.49 X lO" 


7.65 X 10** 


2.72 X lO" 



Table 1: IP datasetl 



months 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


1,2 


1-6 


1-12 


distinct movies (X 10*) 


1.54 


1.58 


1.61 


1.64 


1.66 


1.68 


1.70 


1.73 


1.73 


1.77 


1.73 


1.73 


1.60 


1.71 


1.77 


ratings (xlO") 


4.70 


4.10 


4.31 


4.16 


4.39 


5.30 


4.95 


5.26 


4.91 


5.16 


3.61 


2.41 


8.80 


27.0 


53.3 


minCxlo") 


























3.72 


2.97 


1.68 


maxCxlO**) 


























5.08 


6.79 


7.95 


Li (XIO") 


























1.35 


3.82 


6.27 



Table 2: Netflix data set. Distinct movies (number of movies with at least one rating) and total number of ratings for each month 



(1, . . . , 12) in 2005 and for periods 7^ = {1, 2}, 7^ = {1, . . . , 6},and7^ = {1, . . . , 12}. For these periods, we also show 5^;. 



liming) 



(maXTj) 



(i),andE, 



(^). 



We provide explicit expressions for p{i,r) (Eq. ([3)), for i £ 
S{r), for the rank distributions which we consider. Since we can 
evaluate p{i,r), f{i), and d{i) for all i £ S{r), we can indeed 
apply the generic estimator (Section|5} with S*{r) = S{r). 

Independent ranks (independent order- sketches): The proba- 
bility over r) that i is included in the order-fc sketch of b is 
F„(b) (i) (r^''' (/ \ {i})). It is included in S(r') if and only if it is 
included for at least one of 6 G W. Since r'^''^ (i) are independent. 



follows 

Pr[Ai 
Pr[A2 

Pr[Af 



= Pr[di < Ml] = F 



,(Mi); 



p{i,r) = 1 ■ 



bew 



Pr[(ii > Ml A rf2 < A/2] 

(1 - F„(''i)(.)(*^i))F„(''2)(,)-™(''i)(.)(^'^2) ■ 



Pr[ /\ {da > Ma) Ade < Mf] 

a = l 

F 



(4) 



„(''f)(,)_„(''f-i)(i)(*^f) ■ 



For EXP ranks: p{i,r) = 1 - ni,ew(l - Gxp(-M^''^ (i)r^''' (/ \ 
{i}))) and forlPPS ranks, p(j, r) 



Generic consistent rank assignments (coordinated sketches) 

Let 7?. C W be the set of assignments relevant for / and d. Let 

^(min^) \ {^}) = min,e7^ ' (/ \ {i}) and 



S*(r) = {i I mmr' 



(6) 



Shared-seed consistent ranks (coordinated order-fc sketches): i is 
included in the sketch of b for r' G r) if and only if u{i) < 

-'^mC') (i) (''fe''^ \ i^}))- The probability that it is included for at 
least one of 6 G W is 



p(i,r)=max{F„,„„(ri'')(/\W))} 



(5) 



For all consistent rank assignments, the inclusion probability of i 
in S* (r') over r' G r) is 

p(^,r) = F„,_,„(ri■"'"-)(/\{^})). 

It is easy to see that S* (r) satisfies the requirements of the generic 
estimator (Section[5j. In contrast, the use of 5'*^(r) = S{r) and 
Eq. ([3} required derivations for specific consistent rank distribu- 
tions, A caveat of this estimator (consequence of Lemma [STt is 
that its variance is always at least that of a respective tailored esti- 
mator. 



For EXP ranks: 

p{i,r) = exp(-minbgH;{TO'''^(i)r-^''- (J \ {i})}) and for IPPS 
ranks: p{i, r) — min |l, maxi,gvv{'"''''^ {''')^'k'\l \ {*})}| ■ 

Independent-differences consistent ranks (coordinated order-fc 
sketches): Let w'''''^\i) < ■ ■ < w^'''^\i) be the entries of the 
weight vector of i. Recall that r^''^\i) <— min^^j dj where d 



7. DISPERSED WEIGHTS 

Let r be a rank assignment for (/,>V). The summary S{r) is 
the set of order-fc sketches Sk{I, r'*'') for b G W. In the dispersed 
weights model w^''\i) (for i £ I,b £ W) is included in S{r) if 
and only if i G Sfc(/, r'*''). 



For 7?. C W and i £ I, let w 



(max^) 



EXP[«;(''^ )(j) (we define = and EXP[0] = ^'^''^^H*) = arg maxtgij w^'') (i) (the weight assignment from 

Q) 7?. which maximizes j's weight), and r'"""'^^ (z) = mitibg^ r''' (i) 



We also define Mi = max^^^ r**""^ (/ \ {i}) {£ £ [h]), and 
the event Aj to consist of all rank assignments such that j is the 
smallest index for which dj < Mj. Clearly the events Aj are 
disjoint andp(i,r) = X]f=i PR[^i]- 

The probabilities PR[^f] can be computed using a linear pass 
on the sorted weight vector of i using the independence of di's as 



(the smallest rank value that i assumes for b £ TZ). If r is con- 



sistent then r(™'"'^)(i) 



6(™=''7?,)(i) 



(i) (smallest rank value for 



i is assumed on the assignment with largest weight). Similarly, 



(miiiTC) 



(i) = miiibgTj 



(i),6(™'"'K)(i) = argminb6TC™'''^(i). 



and r^™'"''^' (i) = maxtgTj r'*' (i). When the dependency on TZ is 



clear from context, it is omitted. 
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1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 





1 


81 


1 


80 


1 


75 


1 


68 


1 


65 


1 


55 


1 


56 


1 


42 


1 


50 


1 


61 


1 


54 


1 


47 


1 


48 


1 


52 


1 


52 


1 


48 


1 


45 


1 


37 


1 


38 


1 


38 


1 


42 


1 


46 


1 


47 


high 


1 


85 


1 


83 


1 


81 


1 


72 


1 


70 


1 


63 


1 


61 


1 


54 


1 


61 


1 


67 


1 


57 


1 


53 


1 


57 


1 


57 


1 


56 


1 


52 


1 


49 


1 


44 


1 


43 


1 


45 


1 


49 


1 


50 


1 


54 


low 


1 


78 


1 


73 


1 


70 


1 


57 


1 


57 


1 


50 


1 


45 


1 


33 


1 


46 


1 


52 


1 


45 


1 


40 


1 


44 


1 


49 


1 


49 


1 


42 


1 


38 


1 


34 


1 


34 


1 


33 


1 


39 


1 


42 


1 


44 


close 


1 


82 


1 


75 


1 


72 


1 


65 


1 


59 


1 


56 


1 


48 


1 


46 


1 


58 


1 


57 


1 


47 


1 


50 


1 


50 


1 




1 


51 


1 


45 


1 


44 


1 


40 


1 


36 


1 


42 


1 


44 


1 


48 


1 


51 


adj_t;lose 


1 


81 


1 


74 


1 


72 


1 


64 


1 


58 


1 




1 


47 


1 


45 


1 


57 


1 


56 


1 


46 


1 


49 


1 


50 


1 


54 


1 


51 


1 


44 


1 


43 


1 


39 


1 


36 


1 


42 


1 


43 


1 


47 


1 


50 


volume 


1 


52 


1 


66 


1 


82 


2 


26 


1 


96 


2 


44 


2 


10 


3 


14 


1 


93 


2 


22 


1 


80 


2 


27 


1 


84 


1 


42 


1 


43 


1 


73 


2 


05 


1 


84 


1 


55 


1 


99 


1 


96 


1 


71 


1 


75 



Table 3: Daily totals for 23 trading days in October, 2008. Prices (open, high, low, close, adjusted_close) are x 10^. Volumes are 
inxlO^". 



We also use '"^™\"^^(-^) ~ miiiijgT^ r^'^j(I) and denote the 
weight and rank vectors of i £ / by r''^\i) and w^''^{i). 

We apply the generic derivation (Section[5) using the following 
guidelines: 

(1) If / can be expressed as a linear combination of the form 
/(O ~ + hi'^) + ■ ■ we estimate each summand fj sepa- 
rately. This allows for weaker conditions in the generic derivation, 
resulting in more inclusive sets of applicable samples and tighter 
estimates. 

(2) We determine a set 7^ C W of relevant assignments for / and 
d. In the dispersed weights model, samples taken for assignments 
not in TZ do not contain any useful information. The set S*{r) of 
applicable samples is a subset of UtGTj, ^k{I, r^''^). 

(3) We consider the dependence of / and d on the weight vector 
^C^) We derive estimators for two families of / and d's that in- 
clude the cases where / is to*"""'^), w'""^"'^', or m'^i'^' which 
we used in our empirical evaluation. Our methodology is applica- 
ble to other interesting f's such as quantiles over a set TZ of assign- 
ments. For example we can estimate the sum of the medians of the 
weights ^'^'(i), ■w''-^\i), . . ., over all items i. 

We say that / and d are min-dependent if 



It is easy to see that f{i) — and any predicate d are 

min-dependent, but f{i) — ^''"^''^■'(i) and any d which selects 
items i for which > is not. We derive estimators 

for all min-dependent /, d for both coordinated and independent 
sketches. 

We say that / and d are max-dependent if 

In particular, f{i) = and any attribute-based predicate 

d are max-dependent. We derive estimators for max-dependent / 
and d for coordinated sketches. We also argue that it is not possible 
to obtain unbiased nonnegative estimates for /(i) — w'-™^'''^ ^ (i) 
over independent sketches. 



7.1 Max-dependence 

Max-dependence estimator (coordinated sketches): 



• S*{r) «- {i 1 36G7e 




• For i g 5* (r): 






max{«)('''(i) \be1Z,iG Sfc(/, 




arg max w^''\i) 

be1Z\ieai.{I,r(''>) 


p{i,r) <— 




af{i) 


/(^(maXK)(j-)^^{maXK)(j)) 


p{i, r) 


. Output E,gs.(,)|<i(^, 





As a special case for f(i) = w'™'''''^' (i) and i G S*{r) we 
obtain the adjusted weights: 



Correctness: 

Lemma 7.1. Let r be a consistent rank assignment, (i) S^*\r) 
is the set of\S*{r) \ least-ranked keys with respect to r'"""''^'(i). 
(ii) For each i € S*{r), the computation of b'-'^''''-^^ (i), w'-'^'''"^'' (i) 
as shown in the box above is correct, (iii) | S* (r) | > k. 

Proof, (iii): Since S*{r) contains all the keys in at least one 
of the order-fc sketches Sfc(J, r'*''), (for b with minimum r-^'^j(/)), 
we have |S*(r)| > k. 

(ii): Consider i G S^'\r) and let b = (i). We show 

that i G Sfe(/, r '■''-'). This would immediately imply that the com- 
putation of w^'^^'"^\i) is correct. The computation of 
p{i, r) is correct by consistency of the ranks. 

By the definition of S* (r) we know that there exists a 6' G 72. 
such that r'*" ^ (i) < From consistency follows that 

r''''\i) < r^'''\i), and from the definition of r*™^^(/) follows 
thatri7"^>(/) < rfl-.il). Thus we get that rW(i) < ri''l^{I) 
which means that i G Sfe ( J, r '''^ ) . 

(i): A key i is included in S'^r) if and only if r(™"'^)(i) < 
r<7"^>(J). So if i G S*{r) and r(™^'(j) < r'-'^''^-^\i) then 
jeS*{r). □ 

From Lemma|TT](Property (ii)), for all i G S* (r), we can deter- 
mine b(""=""^' (i) and w^'^'""^'> (i) from S{r) and therefore evaluate 
f{i) and d{i). 

From consistency of r, for i G / wehavep(i, r) = Pr[r''-"""^''(i) < 
ri7r\l)y e ^ii,r)] = F^(„..^,(^,(r(7;^-'(/)). Hence, 
p{i,r) can be evaluated for all i G S*{r) and S*{r) fills the re- 
quirements of the generic estimator (Section[5]l. 

Observe that the generic estimator is not applicable for to'™^"""^ ^ (i) 
for independent sketches: we can evaluate only if i 

is included in all order-fc sketches Sk{I,r^''^), b S TZ (if i is not 
included we can not be certain that we see the maximum weight 
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occurrence of i.) On the other hand, if w''"'"'^' (i) — 0, then i is 
included in all order-fc sketches Sk {I, r'-''' ), b gTZ with zero prob- 
ability. In fact, there is no "well-behaved" (nonnegative) estimator 
for ttj^™^"'^^ (i) for independent sketches. 

7.2 Min-dependence 

Min-dependence l-set estimator: 



• S*,{r) ^ {i \ A,^^rW{i)< ri'l^il)} 

• Vie5;(r), 

pdi,r) ^ PR[Vb G 7e,r'(f'){i) < rf_|^^{/) | r' G r2(i,r 



Si{r) is the set of keys that are included in all \TZ\ order-fc 
sketches. 



r) for shared-seed consistent ranks is: 

.(f) 



Pt{i,r) = minF„(t)(,)(r^i(/)) 



(8) 



For EXP ranks, 



exp(-minu.^''^(i)r^'^^(/)) 



and forlPPS ranks, r) = mm{l,mmi,en{w^''\i)i''i!'^i{I)}}- 
For independent-differences consistent ranks, pe{i, r) is expressed 
as a simultaneous bound on all prefix-sums of a set of independent 
exponentially-distributed random variables. We omit the details. 
For independent ranks: 



(9) 



By contrasting ^ and l|9) we can see that the respective inclu- 
sion probability can be exponentially smaller (in \TZ\) for indepen- 
dent sketches than with coordinated sketches. Since the variance 
VAR[a(i)] is proportional to {j^^^-pj — 1), we can have exponen- 
tially larger variance. 

Let a*"""'^^(i) be the adjusted weight for f{i) = «;'™'"'^)(i) 
of the l-set estimator using shared-seed consistent ranks, and let 
a<™j"^>(i) be the adjusted weight for f{i) = u)(™'"^)(i) of the 
l-set estimator using independent ranks. 

We can also use a smaller set of samples as follows. 
Min-dependence s-set estimator: 



Psii,r) ^ PR[Vfe e 7^,r'(^')(^) < r{;™\"'^^(/) | r' G n(i,r) 



St (r) is the set of keys that are included in all \TZ\ sketches with 
rank value at most ''fe \^)- The advantage of the s-set estimator 
is that for coordinated sketches the inclusion probabilities have a 
simpler formula which is easier to compute namely 

p.(i,r) = F„,„.„^,(^,(ri-r'(/)). 

Lemma 7.2. These inclusion probabilities for the s-set estima- 
tor are correct. 

Proof. Let r be a consistent rank assignment. We have that 
for all b e 7^, r^'''>{i) < r^7"'^'(7) if and only if r(""""^'(i) < 

(min^) 
'fc + 



Therefore, 



Ps{i,r) = Pr[r'('"="'^'(i) < r[":i'^-^\l) \ r' € n{r,t)] 
)(«)( 

□ 



The s-set estimator can be used with independent ranks but there 
is no advantage in doing so. 

As a special case, we obtain adjusted weig htsfor/(i) = w^'^'"^-^\ 

by 



(10) 



for every i £ Ss{r), and ai™'"''^'(i) = otherwise. 
Correctness: It is easy to see that with both consistent and inde- 
pendent ranks, any i with f{i)d{i) > 0, has nonzero probability 
to be included in Ss{r) and Si{r). Furthermore, for all included 
i, the full weight vector ^'^'(i) is available from S{r) and there- 
fore / and d can be evaluated. Therefore the s-set and l-set min- 
dependence estimators satisfy the requirements of the generic esti- 
mator (Section[5]l. 

s-set versus l-set estimators. The l-set estimators have lower vari- 
ance than the s-set estimators: 

Lemma 7.3. For any weight function f and i £ I, 

VAR[a;'-'''(i)] < VAR[aif\i)] 

Proof. Since ^'(r) C S* (r), it follows from Lemma ISTT] that 
the l-set estimator has at most the variance of the s-set estimator 
(since ps{i,r) < pe{i,r)). □ 

7.3 Li difference. 

For a consistent r, we define the lii'-^^^' adjusted weights 

ai^i'^'(i) = a("""''^'(i) -ai"""^)(i) (II) 



(maX7i)/.x „(min7i) 



{i)^a^r'^>{i). (12) 



We use the notation > (i, r-),pi'"'"^' (i, r), andp^™'"'^' (i, r 
for the respective inclusion probabilities. We use the notation a'-™'"'^ 
a^'"^^\ p(minTC) when the statement applies to both the respective 
s-set and l-set estimators. 

We show that for coordinated sketches, our w^'"''-^^ adjusted 
weights are "well behaved," in the sense that they are nonnegative. 
We first establish the following lemma. 

Lemma 7.4. For consistent r with IPPS or EXP ranks, TZ, k, 
and i £ I. 



(maXTj) 



{i,r) 



Proof. Since, pi™'"^' (i, r) < ""'^^ (i, r), it suffices to es- 
tablish the inequality for pi™'"'^' (i, r). 

For IPPS ranks it suffices to show that for any r 

min{l,rK;('"""-^>(j)} ^ w'""""'^' (i) 



min{l,™(™TC)(i)} - ^(,(mmK)(-j-) 

This is clear in case the nominator of the left hand side is rui'™^"^ ' {i 
and the denominator is rw'™'"^-* (i). Otherwise, since rw'-™^'''''^-' (i) 
rui''"'"'^'(i) the nominator is 1 < rw*™'"'^'(i) so the inequahty 
must also hold. 

For EXP ranks, we need to show that for any r, 

1 -exp(-ru>^'"""-^'(i)) ^ 

1 - exp(-™(™K)(j)) - y;(mmK)(-j-) 

Takingr = w^"''''"^\i) /w'~"'''^^Hi) and'y = exp(-™(""'"-^' (i)), 
this follows since for any r > 1 and < 7 < 1, < r. □ 



10 



Lemma 7.5. For consistent r with IPPS or EXP ranks, Mi £ /, 
a^^i'^'W > 0. 

Proof. It suffices to show that a'™'"'^)(i) < a*™^"'^) (i). We 
first observe that a<™^)(i) > imphes a(™^'''^)(i) > 0. If 
^{maxK)^^) > and a(™'"'^)(i) = we are done. Otherwise, the 
claim follows using Lemma FOl □ 

8. VARIANCE PROPERTIES 

8.1 Covariances 

We conjecture that the estimators we presented have zero covari- 
ances. This conjecture is consistent with empirical observations 
and with properties of related RC estimators |17| 1181 . With zero 
covariances, the variance VAR[a'^'(J)] is the sum over i £ J of 
the per-key variances VAR[o'-'^^ (i)] . Hence, if two adjusted-weights 
estimators ai and a2 have VAR[ai(i)] > VAR[a2(i)] for all i I, 
then the relations holds for all J C I. 

Conjecture 8.1. All our estimators for colocated or dispersed 
summaries have zero covariances: For alii ^ j ^ I, E\a'^^\i)a''^\jy\ 
f{i)fU)- 

8.2 Variance bounds 

We use the notation t\^\i) for the RC /-adjusted weights as- 
signed by an RC estimators applied to a order-fc sketch of (/, /). 

We also write \i) as t^*"' (i) for short. 

We measure the variance of an adjusted weight assignment a 
using EV^ [a] = Xlig/ ^'^'^['^(O]- To establish variance relation 
between two estimators, it suffices to establish it for each key i. 
Furthermore, if the estimators are defined with respect to the same 
distribution of rank assignments then it suffices to establish vari- 
ance relation with respect to some Q.{i, r). (Since these subspaces 
partition S7 and our estimators are unbiased on each subspace). 

The variance of adjusted /-weights a'^^ (i) for i £ J are 

VARo(,..) [a<^^ (i)] = f{if (-7^ - 1 V (13) 

Colocated single-assignment estimators. We show that our single- 
assignment inclusive estimators for co-located summaries (inde- 
pendent or coordinated) dominate plain RC estimators based on a 
single order-fc sketch. 

Lemma 8.2. For 6 e W and i £ I, let a''''(i) be the ad- 
justed weights for co-located summaries computed by our estima- 
tor (using S*{r) = S{r) and inclusion probabilities Then, 
VAR[a('')(i)] < VAR[t^'''> (i)]. 

Proof. Consider applying the generic estimator with 5** (r) con- 
taining all keys i with r''''(i) < r^^^j^(/). This estimator assigns 
to i an adjusted weig ht of if r<''>(i) > rf_|^i(/) and an adjusted 

weight of w'-''^ {i) /F ^(b) ^i^{rl!'^-i^{I)) otherwise. This is the same 
adjusted weights as assigned by the RC order-fc estimator if we ap- 
ply it using the rank assignment to (/, w'*'' ) obtained by restricting 
r (that is using r'*"^). The lemma now follows from Lemma ISTI 

A direct proof: It suffices to establish the variance relation for 
a particular subspace Q{i,r) considering the restriction of r'*"^ of 
r as the rank assignment for (/, m'''^). In fl{i, r), the variance of 

tf\i) IS 



From lO, r) > F^(i,)(.)(r-*';/j(/)). Therefore, 

YARn^.,r)la'-'Hi)] = w'^''Hif{l/p{i,r)~l) < VARa(„.) [4"^ (i)] . 
□ 

Approximation quality of multiple-assignment estimators. The 

quality of the estimate depends on the relation between / and the 
weight assignment(s) with respect to which the weighted sampling 
is performed. We refer to these assignments as primary. Vari- 
ance is minimized when f{i) are the primary weights but often 
/ must be secondary: f may not be known at the time of sam- 
pling, the number of different functions / that are of interest can be 
large - to estimate all pairwise similarities we need ('^') different 
"weight-assignments". For dispersed weights, even if known apri- 
ori, weighted samples with respect to some multiple-assignment / 
cannot, generally, be computed in a scalable way. We bound the 
variance of our min, max, and Li estimators. 

Colocated min, max, and Li estimators. We bound the variance 
of inclusive estimators for min, max, and Li using the variance of 
inclusive estimators for the respective primary weight assignments. 

Lemma 8.3. Forfe{maxjz,mmTz,LiTZ},leta''f\i)bethe 
adjusted w^^^ -weights for co-located summaries computed by our 
estimator (using S* (r) = S{r) and inclusion probabilities (O). 

VAR[a<™^^mi = minVARfa'^^fi)! , 
beiz 

VAR[a(™«^(i)l = maxVAR[a(*>(i)l , 

VAR[a''^i'^>(i)] < VAR[a<'"""^)(i)] . 

Proof. It suffices to cestablish this relation in a subspace r). 
All inclusive estimators share the same inclusion probabilities p{i, r) 
and the variance is as in Equation l |13l l. The proof is immediate 
from the definitions, substituting — minbg-R. w;'-''' (i), 

„{max.^)^j^ = max66KW*'''''\ and K;*^! (i) = - 

The following relations are an immediate corollary of Lemma [83] 

SV[a'™^>l < minEV[a''')l , El/fa*'"""^)! < maxEVfa*'')! , 
beiz b&n 

J^^r„(il1J)l < j.yr (max^), ^ ^ax E V 1 . 

b^TZ 

Relative variance bound for max: For both the dispersed and the 
colocated models, we show that the variance of the max estimator 
is at most that of an estimator applied to a weighted sample taken 
with max being the primary weight. More precisely, a'^'^^^'^\i) 
has at most the variance of an RC estimator applied to an order- 
fc sketch of (J, jti'™^"^') (obtained with respect to the same f^, 
{w > 0)). Hence, the relative variance bounds of single-assignment 
order-fc sketch estimators are applicable I16lll7|[26l . 

Lemma 8.4. Let t^^'^'^'^\i) be the adjusted weights of the RC 
estimator applied to an order-k sketch of {I, w'™^"'^'). For any 
i G /, VAR[a('""''^'(i)] < VAR[4"""'''^'(i)]- 

Proof. Bv Lemma l4. 1 I for consistent ranks r ^"""^ ^ = mintg-R. r 
is a valid rank assignment for (J, (using the same rank 

distributions). So it follows that RC adjusted weights with respect 
to can be stated as a redundant application of the generic 
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algorithm with a subset Si{r) containing the k least ranked keys 
with respect to r'™'"'^ \ 

From Lemma im the mapping S*(r) contains the [^'(r)! > k 
least ranked keys with respect to r'™'"^-*. Hence, S'i(r) C S*{r). 
Applying Lemma lSTI we obtain that for alH G J, VAR[a'™="'«'(i)] < 

Dispersed model min and Li estimators. We bound the absolute 
variance of our ty(™'"'<^' estimator in terms of the variance of to''''- 



estimators for b £ TZ. Let fj,*"' be RC adjusted u;^"-' -weights using 
the order-fc sketch with ranks r'''\ 

Lemma 8.5. For shared-seed consistent r, for all i G I, 



VAR[4™«'(i)] < 



maxVAR[t'-''^(i) 

beTZ 



Proof. Fixing i and Q{i,r), for shared seed p^™'"'^^ (i, r) — 
min66,jF^(6)(i)(r-f_>j(J)). 

Let fe' be such that p^-"'"'^)(i,r) = F^^,,,^^^{ri'^{{I)). We 



have that 



(mi„K), 



'(i,r) 

From this follows that 

„(n'i"7?.)(i)2 



(min^) 



„(b)(;)('-fc+l(^)) 

which is equivalent to the statement of the lemma. □ 
It follows from this lemma that there exists a b £ TZ such that 

Lemma 8.6. For consistent r, for all i £ I, 

VAR[a<-^i'^)(i)] < VAR[a''"'"'^)(i)] + VAR[a<™«) (i)] 
Proof. Fixing i and r). With probability p*"""'^^ (i, r), 

With probability p*™^'''^) (i, r) - p*™'"'^' (i, r), 

We have VAR[a<^i'^>(i)] = E[a'^''^^\if] - (u/'"""'^' (i) - 
Substituting in the above we obtain 



VAR[a<-^i^'(i)] = 



p("""'TC)(i,r) ^ ' p(">'"TC)(i,r) 



-2w 



VAR[a<"""'^'(i)] + VAR[a<"'"'^'(i)] 
-2iD<""""^'(i)ju<'"'"'^'(i)(^ 1^ 



□ 



In particular we get ET/[a(^i'^^] < EFfa^™'"-^)] + ET/[a(™-^']. 

9. EVALUATION 

We evaluate the performance of our estimators on summaries, of 
independent and coordinated sketches, produced for the colocated 
and the dispersed data models. 



9.1 Datasets 

• IP datasetl: A set of about 9.2 x 10^ IP packets from a gate- 
way router. For each packet we have source and destination IP 
addresses (srclP and destlP), source and destination ports (srcPort 
and destPort), protocol, and total bytes. 

o Colocated data: Packets were aggregated by each of the follow- 
ing keys. 

keys: 4tuples (srclP, destlP, srcPort and destPort) (1.09 x 10® dis- 
tinct keys). Weight assignments: number of bytes (4.25 x 10^ 
total), number of packets (9.2 x 10*^ total), and 4tuples (uniform - 
1.09 X 10" total). 

keys: destlP (3.76x 10* unique destinations) Weiglit assignments: 

number of bytes (4.25 x lO' total), number of packets (9.2 x 10® 
total), number of distinct 4-tuples (1.09 x 10® total), and destlPs 
(uniform - 3.76 x lO'* total). 

o Dispersed data: We partitioned the packet stream into two con- 
secutive sets with the same number of packets (4.6 x 10®) in each. 
We refer to the first set as periodl and to the second set as period!. 
For each period packets were aggregated by keys. As keys we used 
the destlP or a pair consisting of both the srclP and the destlP. We 
considered three attributes for each key, namely, total number of 
bytes, number of packets, or the number of distinct 4tuples with 
that key. For each attribute we got two weight assignments w'^^^ 
and w'^' one for each period. (See Table[T). 

• IP dataset2: IP packet trace from an IP router during August 1, 
2008. Packets were partitioned to one hour time periods. 

o Colocated data: keys: destlP or 4tuples. weight assignments: 
bytes, packets, IP flows, and uniform (distinct key count). 

We used the packet stream of Hour3 which has 1.73 x 10^ dis- 
tinct destlPs, 1.87x10^" total bytes, 4.93x10^ packets, 1.30x10® 
distinct flows, and 0.94 x 10^ distinct 4tuples. 

o Dispersed data: The packets in each hour were aggregated into 
different weight assignments. We used keys that are destIP or 4tu- 
ples and weights that are corresponding bytes. We thus obtained a 
weight assignment ui'''-' for each hour. 

The following table summarizes some properties of the data for 
the first 4 hours and for the sets of hours TZ = {1,2} and TZ = 
{1, 2, 3, 4}. The table lists the number of distinct keys (destIP or 
4tuples) and total bytes X^i ^''''H*) for each hour or set of hours. 



hours 


12 3 4 


{1,2} {1,2,3,4} 


destIP (X 10'') 
4tuples (xlO*^) 
bytes (X 10^") 


2.17 2.96 1.73 1.76 
1.05 1.17 0.94 0.99 
2.00 1.84 1.87 1.81 


3.41 3.61 
2.10 3.74 
3.84 7.52 



The following table lists, for destIP and 4tuple keys, the sums 



{miiiTC) 



(i), 5:;. and Ei for 



{1,2} and7^ = {1,2,3,4}. 



key 




destIP 




4tuple 


n 


{1,2} 


{1,2,3,4} 


{1,2} 


{1,2,3,4} 


min (X 10^") 


1.51 


1.33 


0.86 


0.82 


max (X 10^°) 


2.33 


3.02 


2.99 


4.92 


Li (xlO^") 


0.83 


1.69 


2.13 


4.11 



• Netflix Prize Data 1441 . The dataset contains dated ratings of 
1.77 X 10" movies. We used all 2005 ratings (5.33 x 10^). Each 
key corresponds to a movie and we used 12 weight assignments 
6 G {1 . . . 12} that corresponds to months. The weig is 
the number of ratings of movie i in month b. (See Table|2]for more 
details.) 

• Stocks data: Data set contains daily data for about 8.9K ticker 
symbols, for October, 2008 (23 trading days). Daily data of each 
ticker had 5 price attributes (open, high, low, close, adjusted_close) 
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Figure 3: Top: IP datasetl (left), IP dataset2 (middle), Netflix data set (right). Bottom: stocks dataset high-values (left), stocks 
dataset volume values (middle). Ratio of estimators for independent and coordinated sketches El^[a'™^"'^']/EF[a'™'"'^*]. 



and volume traded. Table [3] lists totals of these weights for each 
trading day. 

The ticker prices are highly correlated both in terms of same 
attribute over different days and the different price attributes in a 
given day. The correlation is much stronger than for the volume at- 
tribute or weight assignments used in the IP datasets. At least 93% 
of stocks had positive volume each day and virtually all had positive 
(high, low, close, adjusted_close) prices for the duration. This con- 
trasts the IP datasets, where it is much more likely for keys (destlPs 
or 4tuples) to have zero weights in subsequent assignments. 

o Colocated data: keys: ticker symbols; weight assignments: the 
six numeric attributes: open, high, low, close, adjusted.close, and 
volume in a given trading day. 

o Dispersed data: keys: ticker symbols; weight assignments: 
daily (high or volume) values for multiple trading days. 

For multiple-assignment aggregates evaluation, we used the first 
2,5,10,15,23 trading days of October: 7^ = {1,2} (October 1-2), 
7^ = {1, . . . , 5} (October 1-7), 7^ = {1, . . . , 10} (October 1- 
14), 7^ = {1, . . . , 15} (October 1-21), 7^ = {1, . . . , 23} (October 
1-31). The following table hsts J2i t"'"""'^' Ei w'™""'^' 



(i) for these sets of trading days. 









high ( X 10^ 










vol 


mc( 


XlO^") 








1-2 


1-5 


1-10 


1 


15 


1-23 


1-2 


1-5 


1 


10 1 


15 


1-23 




1.82 


1.67 


1.48 


1 


44 


1.33 


1.34 


1.33 


1 


30 1 


15 


1.13 




1.87 


1.89 


1.92 


1 


92 


1.94 


1.80 


2.54 


3 


50 3 


59 


3.77 




0.05 


0.22 


0.44 





49 


0.61 


0.41 


1.20 


2 


20 2 


43 


2.64 



9.2 Dispersed data. 



We evaluate our and w 



estimators as 



defined in Section H] a'-^^"'^), ai"""^\ a<"""^>, ai^'^'^K and 



al^" ' for coordinated sketches and a ^ for independent sketches. 

We used shared-seed coordinated sketches and show results for 
the IPPS ranks (see Section|3j. Results for EXP ranks were similar. 

We measure performance using the absolute EV[a'''^-'] and nor- 
malized nEy[aW] = ^V[a(f^/{J2iei fii))"^ sums of per-key vari- 
ances (as discussed in Section[3]l, which we approximate by aver- 
aging square errors over multiple (25-200) runs of the sampling 



algorithm. 

Coordinated versus Independent sketches. We compare the ui'™'"'^' 

estimators a^™'"^'' (coordinated sketches) and a ■3"'^^ (indepen- 
dent sketches). 

Figure [3] shows the ratio E1/[a^™^ VsV'i^"""'^'] as a func- 
tion of k for our datasets. Across data sets, the variance of the 
independent-sketches estimator is significantly larger, up to many 
orders of magnitude, than the variance of coordinated-sketches es- 
timators. The ratio decreases with k but remains significant even 
when the sample size exceeds 10% of the number of keys. 

The ratio increases with the number of weight assignments: On 
the Netflix data set, the ratio is 1-3 orders of magnitude for 2 assign- 
ments (months) and 10-40 orders of magnitude for 6-12 months. 
On IP dataset 2, the gap is 1-5 orders of magnitude for 2 assign- 
ments (hours) and 2-18 orders of magnitude for 4 assignments. On 
the stocks data set, the gap is 1-3 orders of magnitude for 2 assign- 
ments and reaches 150 orders of magnitude. This agrees with the 
exponential decrease of the inclusion probability with the number 
of assignments for independent sketches (see Section IT!2] |. These 
ratios demonstrate the estimation power provided by coordination. 

Weighted versus unweighted coordinated sketches. We com- 
pare the performance of our estimators to known estimators ap- 
plicable to unweighted coordinated sketches (coordinated sketches 
for uniform and global weights 1181 ). To apply these methods, all 
positive weights were replaced by unit weights. Because of the 
skewed nature of the weight distribution, the "unweighted" esti- 
mators performed poorly with variance being orders of magnitude 
larger (plots are omitted). 

Variance of multiple-assignment estimators. We relate the vari- 
ance of our and m'^i'^' and the variance of 
the optimal single-assignment estimators a'''' for the respective in- 
dividual weight assignments ui'''' {b € 7^)0 Because the vari- 

^^Since there are no well-behaved to'-™^"^'' and to'^^^^ estimators 
for independent sketches, we only consider 

''For IPPS ranks, a'''^ are essentially optimal as they minimize 
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ance of a'™"'^' was typically many orders of magnitude worse, 
we include it only when it fit in the scale of the plot. The single- 
assignment estimators a'''' are identical for independent and coor- 
dinated sketches (constructed with the same k and rank functions 
family), and hence are shown once. 



T.V\a 



(maXK)-| 



and 



to the combined sample size, which is the number of distinct keys 
in the combined sample. We therefore use the notation a'^\ for 

the plain estimator applied to independent sketches and a^'^c for the 
plain estimator applied to coordinated sketches. 

We compare summaries (coordinated and independent) and esti- 
mators (inclusive and plain) based on the tradeoff of variance ver- 
sus summary size (number of distinct keys). Figures [T2l [TB] [141 
and [15] show the normalized sums of variances, for inclusive and 



Figures g] [S] |6] and |7] show, EV[a, 
Sl/[a[^i'^^] and EFfaC"'] for 6 G 71 are within an order of mag- 
nitude. On our datasets, nE1^[a'''^] and nYjV[af^^^''^^\ are clus- 

tered together with knSV < 1 (and decreases with k) (theory plain estimators nSV[af'], nEl/[a^°'], nEl/p.^a^"'], nEl/[a^°^] 

as a function of the combined sample size. For a fixed sketch size, 
plain estimators perform worse for independent sketches than for 
coordinated sketches. This happens since an independent sketch of 
some fixed size contains a smaller sketch for each weight assign- 
ment than a coordinated sketch of the same size. In other words the 
"k" which we use to get an independent sketch of some fixed size 
is smaller than the "k" which we use to get a coordinated sketch of 
the same size. Inclusive estimators for independent and coordinated 
sketches of the same size had similar variance. (Note however that 



says (fc - 2)nT.V < 1.) We also observed that nT.V[a\^^'^\ and 
nEV[a;™'"^'] are typically close to nTjV[a^''^. We observe the 



empirical relations Syfa^™"^'"^''] < Syfay""^'^'] (with larger gap 
when the Li difference is very small), Sy[a^^i'^'] < Sy[a^™'"''^'], 
and Sy[a^"''"'^^] < minj,g7j T.V[a'-'''i]. Empirically, the variance of 
our multi-assignment estimators with respect to single-assignment 
weights is significantly lower than the worst-case analytic bounds 
in Section[8l(Lemma l8.5l and l8.6t . For normalized (relative) vari- 
ances, we observe the "reversed" relations nSy[a^™'"'^^] > nT.V[af^'^^'^\)r a given union size, we get weaker confidence bounds with in- 



nT.V[a 



> nSVfa 



and nSVfa 



{miiiTj), 



> maxjjgTj nEV[a('')]'^^P^'^'^^'^'- samples than with coordinated samples, simply because 



which are explained by smaller normalization factors for ui'™'"'^^ 
and w'^i'^'. 

S-set versus L-set estimators. Figure [8] quantifies the advantage 
of the stronger l-set estimators over the s-set estimators for coor- 
dinated sketches. The advantage highly varies between datasets: 
15%-80% for the Netflix dataset, 0%-9% for IP datasetl, 0%-20% 
for IP dataset2, and 0%-300% on the Stocks data set. 

9.3 Colocated data 

We computed shared-seed coordinated and independent sketches 
and show results for IPPS ranks (see Section [3j. Results for EXP 
ranks were similar. 



We consider the following w'-''^ -weights estimators 



the 



shared-seed coordinated sketches inclusive estimator (Section (6] 
Eq.O. af'^: the independent sketches inclusive estimator in (Sec- 
tion|6] Eq.|4]l. a^'': the plain order-fc sketch RC estimator ( 1261 for 
IPPS ranks). Among all keys of the combined sketch this estimator 
uses only the keys which are part of the order-fc sketch of b. 

We study the benefit of our inclusive estimators by comparing 
them to plain estimators. Since plain estimators can not be used 
effectively for multiple assignment aggregates, we focus on (single- 
assignment) weights. 

Inclusive versus plain estimators. The plain estimators we 
used are optimal for individual order-fc sketches and the benefit of 
inclusive estimators comes from utilizing keys that were sampled 
for "other" weight assignments. We computed the ratios 



EVK^VSVia^"'] and T.V[a^-"]/T.V[a. 

as a function of fc. As Figures l9l 1 1 01 and [TTI show, the ratios vary 
between 0.05 to 0.9 on our datasets and shows a significant benefit 
for inclusive estimators Our inclusive estimators are considerably 
more accurate with both coordinated and independent sketches. 
With indepedent sketches the benefit of the inclusive estimators is 
larger than with coordinate sketches since the independent sketches 
contain many more distinct keys for a given fc. 

Variance versus storage. For a fixed fc, the plain estimator is in 
fact identical for independent and coordinated order-fc sketches. In- 
dependent order-fc sketches, however, tend to be larger than coordi- 
nated order-fc sketches. Here we compare the performance relative 



T.V[a^''>] (and nT,V[a^''>]) modulo a single sample (26l[54). 



we are guaranteed fewer samples with respect to each particular 
assignment.) 

Sliaring ratio. The sharing ratio, |S|/(fc * \yV\) of a colocated 
summary S is the ratio of the expected number of distinct keys in 
S and the product of fc and the number of weight assignments 
The sharing ratio measures the combined sketch size needed so that 
we include an order-fc sketch for all weight assignments. Figure [TT] 
shows the sharing ratio for coordinated and independent order-fc 
sketches, as a function of fc. Coordinated sketches minimize the 
sharing ratio (Lemma l4.2t . On our datasets, the ratio varies be- 
tween 0.25-0.68 for coordinated sketches and 0.4-1 for indepen- 
dent sketches. The sharing ratio decreases when fc becomes a larger 
fraction of keys, both for independent and coordinated sketches - 
simply because it is more likely that a key is included in a sample of 
another assignment. For independent sketches, the sharing ratio is 
above 0.85 for smaller values of fc and can be considerably higher 
than with coordinated sketches. Coordinated sketches have lower 
(better) sharing ratio when weight assignments are more correlated. 
The sharing ratio is at least 1/| W| (this is achieved for coordinated 
sketches when assignments are identical). 

10. CONCLUSION 

We motivate and study the problem of summarizing data sets 
modeled as keys with vector weights. We identify two models 
for these data sets, dispersed (such as measurements from different 
times or locations) and collocated (records with multiple numeric 
attributes), that differ in the constraints they impose on scalable 
summarization. We then develop a sampling framework and ac- 
curate estimators for common aggregates, including aggregations 
over subpopulations that are specified a posteriori. 

Our estimators over coordinated weighted samples for single- 
assignment and multiple-assignment aggregates including weighted 
sums and the Li difference, max, and min improve over previous 
methods by orders of magnitude. Previous methods include in- 
dependent weighted samples from each assignment, which poorly 
supports multiple-assignment aggregates, and uniform coordinated 
samples, which perform poorly when, as is often the case, weight 
values are skewed. For colocated data sets, our coordinated weighted 
samples achieve optimal summary size while guaranteeing embed- 
ded weighted samples of certain sizes with respect to each individ- 
ual assignment. We derive estimators for single-assignment and 
multiple-assignment aggregates over both indepedent or coordi- 
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Figure 4: IP datasetl. Sum of square errors, left: absolute, right: normalized. Top: key=destIP weight=4tuple_count, second row: 
key=destIP weight=bytes. Third row: key=srcIP-f destIP, weight=packets. Fourth row: key=srcIP+destIP, weight=bytes 
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Figure 17: Sharing ratio of coordinated and independent sketches. Left: IP datasetl, key=destIP (4 weight assignments: bytes, 
packets, 4tuples, IPdests). Middle: IP datasetl, key=4tuple (3 weight assignments: bytes, packets, 4tuples). Right: Stocks dataset (6 
weight assignments). Bottom: IP dataset2, key=destIP (left) IP dataset2, key=4tuple (middle) 



nated samples that are significantly tighter than existing ones. 

As part of ongoing work, we are applying our sampling and es- 
timation framework to the challenging problem of detection of net- 
work problems. We are also exploring the system aspects of de- 
ploying our approach within the network monitoring infrastructure 
in a large ISP. 
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